[Feature][Perf] Support Selective CPU Weight Offloading by wzhao18 · Pull Request #34535 · vllm-project/vllm

wzhao18 · 2026-02-13T19:44:41Z

Purpose

This PR adds support to selectively offload parameters to CPU based on name matching. One use case is to only offload experts weights for MoE weights, which is useful for low-concurrency settings. This is turned on by passing argument --cpu-offload-params

Test Plan

Tested offloading Kimi K2 NVFP4 on one GB300.

VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY=1 python3 -m vllm.entrypoints.openai.api_server \
	--model nvidia/Kimi-K2-Thinking-NVFP4 \
	--trust-remote-code \
        --cpu-offload-gb 350 \
        --load-format dummy \
        --cpu-offload-params w13_weight w2_weight

Benchmarking single-user throughput:

vllm bench serve \
  --backend vllm \
  --endpoint /v1/completions \
  --num-prompts 5 \
  --dataset-name random \
  --input-len 100 \
  --output-len 100 \
  --max-concurrency 1 \
  --trust-remote-code \
  --num-warmups 2

Test Result

Before: 15 tok/s

============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  32.17     
Total input tokens:                      500       
Total generated tokens:                  500       
Request throughput (req/s):              0.16      
Output token throughput (tok/s):         15.54     
Peak output token throughput (tok/s):    17.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          31.08     
---------------Time to First Token----------------
Mean TTFT (ms):                          302.71    
Median TTFT (ms):                        326.77    
P99 TTFT (ms):                           327.42    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.94     
Median TPOT (ms):                        61.93     
P99 TPOT (ms):                           61.95     
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.94     
Median ITL (ms):                         61.90     
P99 ITL (ms):                            62.72     
==================================================

Offloading MoE weights only: 31 tok/s

============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  15.81     
Total input tokens:                      500       
Total generated tokens:                  500       
Request throughput (req/s):              0.32      
Output token throughput (tok/s):         31.62     
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          63.24     
---------------Time to First Token----------------
Mean TTFT (ms):                          244.12    
Median TTFT (ms):                        272.78    
P99 TTFT (ms):                           273.59    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.48     
Median TPOT (ms):                        29.48     
P99 TPOT (ms):                           29.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.48     
Median ITL (ms):                         29.47     
P99 ITL (ms):                            29.85     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a useful feature for selectively offloading model parameters to the CPU, which can significantly improve performance in memory-constrained scenarios, as demonstrated by the provided benchmarks. The implementation is clear and follows existing patterns in the codebase. The changes to the configuration and model loading logic are well-integrated. The parameter name matching logic, while a bit subtle, appears correct and robust for its intended purpose. Overall, this is a solid contribution that enhances the flexibility and performance of vLLM.

mgoin

Nice UX, LGTM!

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

ehfd · 2026-03-24T06:57:01Z

@wzhao18 Is it possible to use regex syntax like in e.g., .ffn_.*_exps. (possible in llama.cpp)?

wzhao18 · 2026-03-24T16:23:12Z

@ehfd Can you share the motivation? I did not choose to go with regex because I want to keep it as simple as possible. If you find cases where it could not be expressed with the current way, maybe we should consider support regex.

wzhao18 · 2026-03-24T17:07:58Z

@ehfd Different MoE models name their parameters different, relying on fixed regex patterns to distinguish MoE expert weight may not work.

That said, given the naming convention is the same across layers, I think the current way is expressive enough for offloading any specific model weights. For any model in huggingface, you can check the index file for the weight names - e.g. https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/model.safetensors.index.json

ehfd · 2026-03-25T03:53:40Z

@wzhao18 It turns out your notation is the right one for safetensors. Thank you for the response.

wzhao18 requested a review from heheda12345 as a code owner February 13, 2026 19:44

mergify bot added the v1 label Feb 13, 2026

wzhao18 added 3 commits February 13, 2026 19:50

Only offload moe weights

8240011

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Parameterize offloading weights

06f1ba1

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

format code

fe927de

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 force-pushed the wzhao/cpu-offload-moe-only branch from 9683388 to fe927de Compare February 13, 2026 19:50

gemini-code-assist bot reviewed Feb 13, 2026

View reviewed changes

mgoin approved these changes Feb 13, 2026

View reviewed changes

mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Feb 13, 2026

github-project-automation bot added this to NVIDIA Feb 13, 2026

github-project-automation bot moved this to Ready in NVIDIA Feb 13, 2026

Fix pydantic validation error

8435c8b

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

vllm-bot merged commit b37b679 into vllm-project:main Feb 14, 2026
59 of 62 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 14, 2026

athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Feb 14, 2026

[Feature][Perf] Support Selective CPU Weight Offloading (vllm-project…

918daea

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>

athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Feb 14, 2026

[Feature][Perf] Support Selective CPU Weight Offloading (vllm-project…

8ea23e4

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>

wzhao18 added a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026

[Feature][Perf] Support Selective CPU Weight Offloading (vllm-project…

23d0a38

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026

[Feature][Perf] Support Selective CPU Weight Offloading (vllm-project…

24ef6fa

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026

[Feature][Perf] Support Selective CPU Weight Offloading (vllm-project…

c1d828f

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Feature][Perf] Support Selective CPU Weight Offloading (vllm-project…

463a25d

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Feature][Perf] Support Selective CPU Weight Offloading (vllm-project…

ec8f1e0

…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

This was referenced Mar 26, 2026

[RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline #38256

Open

[MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size) #37190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][Perf] Support Selective CPU Weight Offloading#34535

[Feature][Perf] Support Selective CPU Weight Offloading#34535
vllm-bot merged 4 commits intovllm-project:mainfrom
wzhao18:wzhao/cpu-offload-moe-only

wzhao18 commented Feb 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

ehfd commented Mar 24, 2026 •

edited

Loading

Uh oh!

wzhao18 commented Mar 24, 2026 •

edited

Loading

Uh oh!

wzhao18 commented Mar 24, 2026 •

edited

Loading

Uh oh!

ehfd commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wzhao18 commented Feb 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ehfd commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wzhao18 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wzhao18 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wzhao18 commented Feb 13, 2026 •

edited by github-actions bot

Loading

ehfd commented Mar 24, 2026 •

edited

Loading

wzhao18 commented Mar 24, 2026 •

edited

Loading

wzhao18 commented Mar 24, 2026 •

edited

Loading